Robustness of Threshold-Based Feature Rankers with Data Sampling on Noisy and Imbalanced Data

نویسندگان

  • Ahmad Abu Shanab
  • Taghi M. Khoshgoftaar
  • Randall Wald
چکیده

Gene selection has become a vital component in the learning process when using high-dimensional gene expression data. Although extensive research has been done towards evaluating the performance of classifiers trained with the selected features, the stability of feature ranking techniques has received relatively little study. This work evaluates the robustness of eleven threshold-based feature selection techniques, examining the impact of data sampling and class noise on the stability of feature selection. To assess the robustness of feature selection techniques, we use four groups of gene expression datasets, employ eleven threshold-based feature rankers, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, MI and Dev show the best stability on average, while GI and PR show the least stability on average. Results also show that trying to balance datasets through data sampling has on average no positive impact on the stability of feature ranking techniques applied to those datasets. In addition, increased feature subset sizes improve stability, but only does so reliably for noisy datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of Classifiers in Software Fault-Proneness Prediction

Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...

متن کامل

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

 Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

Imbalanced Data SVM Classification Method Based on Cluster Boundary Sampling and DT-KNN Pruning

This paper presents a SVM classification method based on cluster boundary sampling and sample pruning. We actively explore an effective solution to solve the difficult problem of imbalanced data set classification from data re-sampling and algorithm improving. Firstly, we creatively propose the method of cluster boundary sampling, using the clustering density threshold and the boundary density ...

متن کامل

Improving SMOTE with Fuzzy Rough Prototype Selection to Detect Noise in Imbalanced Classification Data

In this paper, we present a prototype selection technique for imbalanced data, Fuzzy Rough Imbalanced Prototype Selection (FRIPS), to improve the quality of the artificial instances generated by the Synthetic Minority Over-sampling TEchnique (SMOTE). Using fuzzy rough set theory, the noise level of each instance is measured, and instances for which the noise level exceeds a certain threshold le...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012